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ABSTRACT 

We consider a serious, previously-unexplored challenge facing al- 
most all approaches to scaling up entity resolution (ER) to multi- 
ple data sources: the prohibitive cost of labeling training data for 
supervised learning of similarity scores for each pair of sources. 
While there exists a rich literature describing almost all aspects 
of pairwise ER, this new challenge is arising now due to the un- 
precedented ability to acquire and store data from online sources, 
features driven by ER such as enriched search verticals, and the 
uniqueness of noisy and missing data characteristics for each source. 
We show on real-world and synthetic data that for state-of-the-art 
techniques, the reality of heterogeneous sources means that the 
number of labeled training data must scale quadratically in the 
number of sources, just to maintain constant precision/recall. We 
address this challenge with a brand new transfer learning algorithm 
which requires far less training data (or equivalently, achieves supe- 
rior accuracy with the same data) and is trained using fast convex 
optimization. The intuition behind our approach is to adaptively 
share structure learned about one scoring problem with all other 
scoring problems sharing a data source in common. We demon- 
strate that our theoretically motivated approach incurs no runtime 
cost while it can maintain constant precision/recall with the cost of 
labeling increasing only linearly with the number of sources. 

Categories and Subject Descriptors 

H.2 [Information Systems]: Database Management; 1.2.6 [Artificial 
Intelligence]: Learning; 1.5.4 [Pattern Recognition]: Applica- 
tions 

General Terms 

Algorithms, Experimentation 

Keywords 

Entity resolution, deduplication, record linkage, data integration, 
transfer learning, multi-task learning, convex optimization 
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1. INTRODUCTION 

In this paper we investigate a serious and previously-unexplored 
challenge to scaling joint entity resolution (ER) to multiple sources: 
that of intractable labeling costs required to model heterogeneities 
in real-world data sources. 

Significant attention has already been focused on ER in the DB, 
data mining and statistics communities, where the typically-stated 
goals are computational performance (good runtime) and statistical 
performance (good precision/recall) — cf. e.g., 1 1 3 , 18 1 and refer- 
ences therein for general discussions on ER. The most common 
approach for achieving good precision/recall is to employ super- 
vised learning to combine domain-expert- selected feature scores 
into overall similarity scores 0|^[l7][^|^|26]|28l|32][35). In- 
deed a recent, comprehensive evaluation of over 20 state-of-the-art 
ER systems 1 19 1, Kopcke et al. found that on most tasks supervised 
learning-based matchers offer superior performance. 

However Kopcke et al, also noted that statistical performance 
comes at the price of human effort in labeling training examples, 
and explicitly highlight labeling cost as a key measure of matcher 
performance. But while there have been studies on multiple-source 
ER, and there are numerous applications in science, technology 
and medicine motivating effective approaches for ER over multi- 
ple sources 1 16 20 3 1 1, we are the first to note that state-of-the-art 
ER approaches have intractable labeling cost on multiple sources. 
Indeed to maintain constant precision/recall, we show that exist- 
ing approaches suffer labeling costs that scale quadratically as the 
number of sources increaserl The need for learning individual 
score functions when faced with data heterogeneity has been ex- 
plicitly 1 29 1 and implicitly |17 | acknowledged previously (cf. re- 
lated work Section|6]for more); however we are the first to compre- 
hensively quantify this requirement. Finally, just as computational 
scaling can be tackled via cloud computing, one may look to hu- 
man computation (e.g., via Amazon Mechanical Turk) to address 
the labeling cost challenge. However, very many ER problems in- 
volve integrating highly privacy-sensitive, or trade-secret, data that 
cannot be outsourced. 

Our negative results on state-of-the-art approaches would sug- 
gest an impossible trade-off between precision/recall and labeling 
cost when performing ER on even a moderate number of real- 
world, heterogeneous data sources. To address this problem, we 
develop a brand new transfer learning algorithm that jointly learns 
to score pairs of data sources while adaptively sharing common 
patterns of data quality. Training our algorithm TRANSFER in- 
volves solving a convex optimization program via fast state-of- 
the-art composite gradient methods |24|. Motivated by a multiple- 
source ER problem for the movies vertical in a major Internet search 

'We focus on the more general pairwise matching problem as op- 
posed to easier matching of multiple sources to a single master. 
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Figure 1: A typical ER data flow. This paper focuses on the learning step, under multiple data sources, that infers the similarity of 
each pair of entities from attribute similarities, based on human-labeled examples of matching and non-matching record pairs. 



engine, we demonstrate both on a large real-world movie entity 
crawl dataset (with sources lOx larger than any considered in 1 19 1) 
and a large-scale synthetic dataset, that our TRANSFER algorithm 
is superior compared to state-of-the-art approaches while incurring 
a labeling cost that is only linear in the number of sources being 
resolved. While this constitutes a major contribution to entity res- 
olution, TRANSFER is also of independent interest as a novel con- 
tribution to machine learning research as it leverages a previously- 
unseen pairwise structure between learning tasks that is motivated 
directly by the application to ER^j 

Organization. In Section[2]we present a precise problem state- 
ment and elaborate on our running movie matching example. We 
then develop the TRANSFER learning algorithm for low-labeling- 
cost multiple-source ER in Section [5] Sections [4] and [5] present 
thorough experimental evaluations on both real- world and synthetic 
data. Finally we discuss related work in Section [6] and conclude 
with directions for future work in Section|7] 

Notation. On vectors v £ W, we let the l q norm for q > 1 
be defined as ||«|| 9 := (J2%i \ v j\ q ) 1/q < and IMloo = max, \vi\. 
We let sign(-u) denote the vector of the same dimensions whose i th 
element is the sign of Vi or fi/|t>i| if Vi 7^ 0, and is equal to zero 
otherwise. 

2. PROBLEM STATEMENT 

We now formalize our problem, which is to produce functions 
that combine p similarity feature scores g(e.i,ej) S W between 
two entities e; and ej taken from their respective sources Si and 
Sj . As is common in ER, the feature scores are typically chosen by 
a domain expert; the output of the combination represents an over- 
all similarity score between the entities that should achieve strong 
precision and recall. We consider r > 2 sources, and so i 7^ j 
will be taken from {1, . . . , r} unless stated otherwise. As we shall 
demonstrate empirically, automatically learning the combination of 
feature scores typically requires prohibitively large amounts of la- 
beled training data for large r. 

DEFINITION 1. The formal goal of the Multi-Source Similar- 
ity Learning Problem is: for each pair of sources learn a 
similarity scoring function fij mapping feature space attributes 
g(ei,ej) £ W to a real-valued score. Negative (non-negative) 
scores are interpreted as predictions by fij that a pair of entities is 
non-matching (matching), and the magnitude of the scores corre- 
sponds to a measure of confidence in the predictions. We desire to 
learn fij that achieve strong precision and recall using few labeled 
examples. 

2 Existing transfer learning approaches suit only the simpler 
multiple-source ER problem of all-against-master (as opposed to 
pairwise). TRANSFER formally subsumes such approaches. 



Figure[T]depicts a typical ER system producing scores which can 
be fed into subsequent merge or clustering functions |18| ; resulting 
scores are typically thresholded to produce the resolution. Prior to 
feature scoring and score combination, the entities are normalized 
in a pre-processing step (e.g., in movie matching, producing clean 
movie titles, cast, directors, release years and runtimes); and then 
blocking is employed to prune the pairs of entities considered in 
scoring, via a linear pass hashing entities to blocks (e.g., movie en- 
tities are hashed to their non-stop-title-words so that only movies 
with a rare word in common are scored). In the movie matching ex- 
ample, feature scoring may produce title edit distance, year & run- 
time absolute difference, and Jaccard coefficients for cast and direc- 
tors; then score combination computes overall scores after learning 
how to do so from a human-labeled set of matching/non-matching 
entity pairs. 

3. TRANSFER LEARNING ALGORITHM 

The primary goal of machine learning approaches is statistical 
efficiency, formalized by the notion of a learning algorithm's sam- 
ple complexity: the amount of training data required for a desired 
accuracy (with high confidence). The transfer learning paradigm | 2 
|14||15||2"T| has enjoyed recent interest in the machine learning and 
statistics communities, due to its general principle of exploiting in- 
formation gleaned in multiple related learning tasks to reduce the 
tasks' sample complexities. This section develops a new transfer 
learning algorithm for the Multi-Source Similarity Learning Prob- 
lem. As well as contributing a solution to the seemingly intractable 
labeling cost of performing ER over multiple sources, our algo- 
rithm TRANSFER represents a contribution to machine learning re- 
search as it presents an approach to a novel transfer learning prob- 
lem with a unique inter-task structure. 

We now briefly overview the intuition behind transfer learning 
approaches in general. In one setting of transfer learning we may 
consider the problem of first performing one learning task (or sev- 
eral) and then using the obtained information to make a new learn- 
ing task more efficient. In another setting, we have multiple tasks 
that we wish to learn from simultaneously in the hopes that jointly 
learning all models will result in a net decrease in sample com- 
plexity. A common characteristic that each of these settings share 
is that we wish to leam statistically independent tasks. Each of 
these tasks represent separate problems that share some common 
structure: either shared support 1 14 21] or shared subspaces |2j. 
Figure[2]depicts the intuition of how transfer learning improves ac- 
curacy via the latter approach. This figure represents the setting in 
which the learning tasks' inferred models (here vectors) should be 
highly clustered. This common structure allows us to learn from the 
available information more effectively by considering the problems 
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Figure 2: Learning can be viewed as approximating a true con- 
cept (solid vectors) with a model (dotted vectors) taken from a 
class of models (grey ball). Left, three learning tasks (in red, 
blue & pink) are typically performed independently. Task data 
is not pooled, and a large class of models may be consistent 
with the observed data. Right, transfer learning jointly learns 
a model class common to the tasks thereby implicitly sharing 
task data and effectively reducing the model class complexity. 



jointly rather than separately. 

In all forms of learning, the chosen classifier is taken from some 
subspace of models that depends on parameters to the learning al- 
gorithm, and the training data. With more data, the class of mod- 
els can be tightened to yield a more precise classifier. In low-data 
settings, this is difficult to achieve. Transfer learning approaches 
jointly learn the model class common to many learning tasks, while 
learning each individual task's classifier. In so-doing, transfer learn- 
ing is able to learn accurately with less data. A key element to 
designing a successful transfer learning scheme is to appropriately 
constrain the structure of the model class to reflect the shared prop- 
erties of the tasks' true classifiers. For example, the shared struc- 
ture in Figure [2] is depicted as the true classifiers sitting in a small 
Euclidean ball. 

A challenge that arises in our setting of tasks corresponding to 
learning source-pair similarity functions {fij} is in handling the 
interactions between the sources {Si}. For example, a standard 
transfer learning approach to learning a scoring function between 
sources A and B and between A and C would be to treat these 
two tasks just as it would third task for D, E, ignoring the fact that 
some tasks share a common source: here A. A new model would 
allow us to more accurately learn a scoring function across pairs of 
sources for which no available training examples exist. 

3.1 Learning Models 

We begin to derive our new transfer learning algorithm for ER by 
expressing the class of models our algorithm will learn over. For 
reasons made clear below, we design TRANSFER to learn linear 
classifiers; however we later compare this approach against state- 
of-the-art non-linear algorithms, and we note that the techniques 
described here are general and can be kernelized to produce non- 
linear analogues. 

3.1.1 Linear Classifiers in ER 

Specifying an appropriate model allows us to avoid overfitting 
the training data. However, as the complexity of our model in- 
creases so too does the number of training examples — the sample 
complexity — required to fit all of the available "degrees of free- 
dom" of our model. Hence, we will need to take the amount of 
available training data and the learning task at hand into considera- 
tion when specifying our model. 

In ER (9j[l7j[25j[28j[35), and many other problems (20)[30j, it 
has been shown that linear models perform exceptionally well for 



explaining the behavior between feature score vectors x and out- 
put labels y. The choice of a linear model serves a dual statisti- 
cal and computational purpose. Linear models can be evaluated 
very quickly and are also inexpensive to store, requiring only p + 1 
doubles — together making the model ideal for large-scale learning. 
From a statistical perspective, given enough features, we can accu- 
rately model the interactions in our data. Formally, we assume that 
a given input set of features x and an output label y can be related 
by 

V = sign((tu, x) + b) , 

where b is a bias term capturing the fact that the model is not exact 
due to noise. Here w acts as weight vector, placing varying impor- 
tance on each of the feature scores in x. The setting of w results in 
splitting our feature score space into two half-spaces, since we have 
two classes. Finally, we will take our similarity scoring function for 
a given source pair (2, j) to b^j 

fij(x) = (Wij, x). 

Hence the Multi-Source Similarity Learning Problem corresponds 
to inferring the weight vectors Wij . 

We will compare the transfer learning approach based on linear 
models of this section, with both linear and non-linear state-of-the- 
art baselines in Sections [4] and [5] There is a trade-off: on the one 
hand, more features allow us to model more interactions, on the 
other hand, more features result in a more complex model that can 
be susceptible to overfitting, at a detriment to sample complexity 
i.e., labeling cost. 

3. 1.2 Transfer Learning Model 

While we have a number of separate tasks across different pairs 
of sources — naively leading to a quadratic scaling of the sample 
complexity with the number of sources r — one training example 
from source pair (i, j) could inform learning to score another source 
pair (j, h) and conceivably even (h, k). This intuition motivates 
our interest in applying transfer learning to uncover shared charac- 
teristics between the different source pair tasks. As borne out in our 
experiments, doing so will effectively allow us to share examples 
across many different source pairs in order to most efficiently use 
the available resources and successfully scale to resolving multiple 
heterogeneous data sources. 

Two extreme forms of transfer are in common use in practice 
today: either learn each classifier fij separately (so as to model 
heterogeneities in the sources at a great labeling cost), or pooling 
all available data and learning a single classifier (mitigating the la- 
beling cost at the expense of flexibility). Both existing approaches 
represent two extreme forms of transfer (none and complete trans- 
fer, respectively). An ideal method should behave between both ex- 
tremes and allow the data to dictate the most appropriate behavior. 
When there is very limited data, we may not have enough informa- 
tion to describe the difference in characteristics between sources. 
As we gain more information, our method should adapt and take 
into account any added information. To that end, we introduce a 
method that we call transfer. For this model, we assume that our 
weight vectors decompose as 



= wo + Wi + Aj 



(1) 



where the vector wq captures the general trends, for example, movies 
with the same casts are generally going to be similar. The weight 
vector Wi accounts for the specific effects induced by the particular 
source and the vector A„ handles the pairwise deviations and can 
also be applied to guarantee that Wij = Wji. 

3 We drop the term b for convenience, without loss of generality. 



3.2 Regularized Learning Formulation 

We now formulate an optimization program for learning the un- 
derlying pairwise score functions. There has been a flurry of re- 
search in developing efficient techniques for finding parameters 
that can accurately describe the data using models as those de- 
scribed above. A number of techniques are based on optimizing 
a convex function for efficiently recovering the parameters. Such 
convex programs have seen tremendous theoretical and experimen- 
tal success in the literature [7 ,21 ,34 1. 

Before proceeding, we recall that the sources are indexed by an 
integer in {1, ... ,r) so that (abusing notation) S G {1, . . . , r}. 
Furthermore, we will let (i(k),j(k)) denote the source pair that the 
A; 4 example was drawn from. Given that, we write our k th training 
example as (x k , S ltk ,S 2 ,k, yk), where x k = g(e 1 , k , e 2 , fe ) denotes 
the feature vector representation of the pair of entities (ei, k , e2, k ), 
Si,fc and 52, k represent the source indices of the entities, and y k 
represents the true label. With this notation, we may propose to 
learn an Equation {T} model that solves the convex program 

1 - 

argmin roo tl , iiA . j - ^{y k - (w +w i(k) + A l(k)j{k) , x k )f 
fe=i 

v v ' 

empirical risk term 

r 

+ AX>iiii + f £iia«ii2 

i=l i,j 
" v ' 

regularization terms 

s.t. wo + Wi + Aij = wo + wj + Aji (2) 

The result of this program are estimates of our weight vectors wo, 
Wi, and Aij. We now take a moment to discuss Program 0. The 
objective function can be decoupled into two components: an em- 
pirical risk or loss term and a regularization term. 

Loss Term. The loss term aims to encourage predictions on the 
training input feature vectors to match the training labels. Further- 
more, we note that our assumption on the form of the pairwise 
score functions is built into the optimization procedure. That is, 
{wo + Wi + Aij, x k ) should be close to y k . We note that there are 
other alternative options for the loss term such as those used in the 
logistic regression or support vector machine linear models, both 
of which are also used extensively in the literature. 

Regularization Terms. While the empirical loss term encour- 
ages our parameters to closely fit the model, the regularization 
terms exist in order to penalize overly-complex models and avoid 
overfitting. For the regularization we penalize the source weight 
vectors Wi by the £\ norm and the pairwise weight vectors Ay by 
the £2 norm squared. These choices have both been extensively 
studied in the literature due to a number of desirable consequences 
that they each have. It has been shown that the £\ norm encourages 
solutions to convex optimization procedures to be sparse (the £\ 
norm essentially acts as a convex surrogate to the £0 norm, or the 
total number of non-zero parameters in a vector). Authors have es- 
tablished both theoretical results and experimental results demon- 
strating the performance of the £\ norm (7] |11|[33) . By encourag- 
ing the Wi to be sparse, we capture the fact that each source (for 
the most part) should behave as the nominal source represented 
by wo. This type of assumption has also appeared in the context 
of robust regression [15] and low-rank sparse matrix decomposi- 
tions |8|. The £2 norm squared terms acts to restrict the size of Ay 
(without necessarily requiring sparsity), allowing pairwise pertur- 
bations away from the nominal behavior between two sources but 
avoiding overfitting |27| . This choice also accounts for the fact that 
in general wo + Wi will not in general equal too + Wj ■ 



Parameter Selection. We may tune the parameters A and fi to 
achieve various levels of model complexity and control the amount 
of transfer. These parameters can be selected via extending existing 
theoretical results in the literature (3) or based on a user's prior 
knowledge for the problem. Another popular method (adopted in 
this paper) is to apply cross validation, and use a hold-out set of the 
data to select the parameters 1 12]. 

Extending the Learning Algorithm. Our construction allows 
for a number of choices for the empirical risk and regularization 
functionals, and we found that our current choices worked well 
practically from a statistical perspective as well as a computational 
one. It would be a relatively trivial task to modify our optimization 
to be more like a pairwise transfer learning version of other lin- 
ear model-based learners such as logistic regression or SVMs. Our 
contribution is a generic transfer learning approach for ER which 
encompasses a family of algorithms; one of which we focus on here 
as a first study on using transfer in multiple-source ER. 

3.3 The Algorithm 

We now proceed to derive methods for solving Program {2). 

3.3.1 Optimality Conditions 

The following result derives from applying the Karush-Kuhn- 
Tucker (KKT) conditions that govern the program's optimality con- 
ditions (6). 

LEMMA 2. Suppose that we are given optimal solutions to the 
convex Program ([2}.' Wq, w^, and A*j, and that we let 

X lJ := y x k x^ , and 

{fc|i(fc)=i,j(fc)=j} 

9 13 ■= ^2 XkVk ■ 

{k\i(k)=i,](k)=j} 

Then an application of the KKT conditions yields that 

A*j = (X + 2 f il)- 1 (g - Xw*o + fiw* -(X + ^I)w*) . 

We would like A*j to be as small as possible, which would equate 
to setting /1 as large as possible. Therefore, letting /1 go to 00, we 
have that 

A*j = - (w* ~ w*) . 

Therefore, we observe that under /1 — > 00, each A*j is simply half 
the difference between the source vectors w* and wlj. Thus, we 
immediately obtain 

Wo + W* + A*j = Wo + - (w* + Wj) . 

The above setting of /j, and subsequent choice of A*j is used through- 
out the remainder of the paper. This model is one that lends itself 
to far simpler computation since we are only required to compute 
Wi for < i < r. Furthermore, the model still captures interesting 
characteristics of each of the sources while learning the common 
characteristics shared across all sources. 

We now present our method for solving Program Standard 
convex solvers can be employed in the setting when the number of 
sources and the dimensionality of the problem are small. However, 
as the number of sources r and the number of dimensions p both 
grow, we must rely on specialized methods that can overcome the 
potential computational challenges for solving such minimization 
programs. Before proceeding we define the loss function to be 

1 n 1 

C(wo,W*) = - ^(yk - (TOO + y(Wi{k) + w j(k) ), x k )f 



and the regularization terms to be 

r 

n(wi) = A^||iui||i 

i=l 

3.3.2 Solution via Composite Gradient Methods 

We now develop a simple algorithm for solving our above convex 
program based on composite gradient descent methods for optimiz- 
ing composite objective minimization problems (24). The method 
is an iterative algorithm that updates the estimates at each time step 
based on the current gradients. If we take Wi(s) to be the iterate at 
the s* iteration of the algorithm, then we have that 

w (s + l) := wu(s) - a\7 W0 £(w (s),Wi(s)) 

w l (s + l) ■— S a \(wi(s) - aV Wi £(w (s),Wi(s))) , 

where S T (v) is the soft-thresholding operator defined on vector v, 
with parameter t, as 

S T (v) — sign(w) max(|f | — r, 0) . 

Here max(|t> — r, 0) is the vector of the same dimension as v 
whose i th entry is the maximum of \vi\ — r and zero. 

Intuitively, we update the vectors in the direction that will de- 
crease the loss the most. In the case of Wi, we must also account 
for the l\ regularization terms, which result in truncation opera- 
tions after taking the gradient step. The addition of an t\ regular- 
ization term makes the optimization procedure non-smooth. While 
we could employ second-order methods 1 6 1 for solving this prob- 
lem, those would be intractable for problems in higher dimensions. 
Hence, we rely on first-order gradient-based methods for compu- 
tational tractability. The down-side of applying a gradient-based 
method is that optimizing general non-smooth functions can be- 
come prohibitively slow. However, it has recently been shown that 
optimizing based on composite objective methods will result in it- 
erates that converge geometrically fast 1 1 1. 

4. EXPERIMENTS 

We next discuss experiments for verifying the behavior of our 
transfer learning algorithm and to compare it against the state-of- 
the-art. The results presented in Section [5] demonstrate significant 
gains on real-world movie matching and synthetic datasets, show- 
ing that TRANSFER can achieve strong performance with low la- 
beling cost that scales only linear with the number of sources. 

4.1 Baseline Approaches 

We consider three approaches representing the spectrum of state- 
of-the-art in ER: pairwise and pooled linear classifiers (which as we 
argue are actually special cases of transfer learning), and support 
vector machines (a non-linear learner popular in ER). 

Single. The first model, called the SINGLE or Pooled method, 
simply assumes that Wij = Wq for all pairs of sources i,j i.e., by 
constraining all the Wi = Ay = in TRANSFER. We have pooled 
all of the tasks into a single base task — we essentially impose max- 
imum transfer between each task. In this setting, we are effectively 
required to only estimate p + 1 parameters, which can be done very 
effectively with order p training examples [23|. Hence, we have 
greatly reduced the model complexity of the problem at the expen- 
sive of ignoring any of the unique behavior of individual sources. 

Pairwise Independent. At the other extreme, the method Pair- 
WISElNDEPENDENT considers the situation where all normal vec- 
tors Wij are learned without any shared characteristics. This set- 
ting has no transfer as we make no assumption as to the struc- 
ture between the tasks of learning pairwise scoring functions. The 
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Table 1: The number of entities per crawled movie data source. 



model complexity is prohibitively large since the number of param- 
eters to estimate scales as r 2 p. Hence, we would require order r 2 p 
training examples just to learn each of the classifiers. However un- 
der heterogeneous sources, with enough data, this approach should 
achieve far superior accuracy over SINGLE. 

Non-linear. Our third baseline model, denoted NONLINEAR, 
involves learning a non-linear function. Unlike the above linear 
models that take a weighted sum of the pairwise attributes, the 
non-linear learner will return an arbitrary function on the features. 
So that NONLINEAR may vary its output depending on the orig- 
inating sources, it is natural to encode the source pair identities 
in addition to the feature scores — the feature vector presented is 
(ff( e i)i ff( e i)> h j) w i m source pair encoded into a length r 
vector that is all zeros except a 1 in the i th and j th components 
(since source ordering is irrelevant). In this way NONLINEAR has 
the flexibility of modeling sources individually, while implicitly 
transferring patterns learned between sources. For our experiments 
we take NONLINEAR to be the support vector machine (SVM) with 
Gaussian kernel, which corresponds to the most widely used and 
flexible feature mapping. This SVM takes in two parameters: the 
cost parameter C > and the kernel variance a > 0. 

Remark 3. Single and PairwiseIndependent are both 
special (extreme) cases of TRANSFER and are also of indepen- 
dent interest since together they represent one kind of state-of-the- 
art technique — linear classification — that has enjoyed success in 
ER 17, 25, 2<S]|35|?. With adoptively selected parameters (via 
cross validation), we expect TRANSFER to find a balance between 
the label-economical but inflexible SINGLE and the flexible but 
label-hungry PAIRWISEINDEPENDENT. 

SVMs are regarded as one of the most effective learners in ER |3] 
|5|iOJ i 7ffT9|22|2<5y . Like Transfer, the SVM can adapt its model 
complexity for the problem (via its parameters). However to take 
advantage of its flexibility over linear methods more data is typi- 
cally needed. Moreover while it may learn different classifications 
for different sources (as these are encoded in the feature vector), 
and indeed transfer between these tasks, the SVM does not possess 
the pairwise structural knowledge that is built in to TRANSFER. 

4.2 Algorithm Implementation 

For our experiments, we implement TRANSFER as described 
in Section [3] in the statistical computing environment R. We im- 
plement the baseline SINGLE and PAIRWISEINDEPENDENT algo- 
rithms based off of the more general TRANSFER implementation; 
however to speed up the baseline algorithms in-line with fast state- 
of-the-art implementations — for fair and representative comparisons 
in our timing experiments — we exploit standard computational tricks 
not available for the general transfer learning case. 

We use the R el071 package's SVM routines, which are a wrap- 
per for the popular libSVM library, for implementing NONLIN- 
EAR. We employ 10-fold cross-validation for selecting optimal 
SVM parameters (C, a) over a grid of candidates as is standard. 

4.3 Evaluation 

In order to investigate the statistical performance of the methods' 
scores, we adopt a common threshold algorithm: we declare that 
two entities ei and e2 are a match if their score is above threshold r. 



which we vary to produce a set of potential classifiers. Therefore, 
given a scoring function / and a set of examples {(xk,Vk)}, we 
aim to compare the true labels y k against the estimated labels 

y k = sign(/O fc ) - r) . 

We evaluate the performance of classifier's classifications through 
precision and recall, defined in the usual way as follows. 

Definition 4. Let 

TP := {k\y k = yk — 1} be the set of true positives, 

FP := {fclj/fc = 1 and y k = 0} be the set of false positives, 

FN := {fcjj/fc = and y k = 1} be the set of false negatives. 

We define precision P and recall R in terms of these sets: 

p . = \TP\ R , = \TP\ 

' \TP\ + \FP\ ' \TP\ + \FN\ ' 

Hence, for varying threshold r, we have a range of different P and 
R values that together to form a precision-recall curve. We also 
measure test error as follows, which combines both false positives 
and negatives. 

m *■ — ' 

h 

4.4 Datasets and Pre-Processing 

We employed two large-scale datasets in our experiments. 

Real-World Move Data. Six major online movie sources were 
crawled for use in the Bing movies vertical. The number of records 
obtained are given in Table [T] For each movie we obtained its title 
and alternate titles, release year, runtime, cast, and directors. From 
these attributes we performed basic string cleanup and blocked on 
common (non-stop) words in the titles. Each raw feature produced 
one feature score: Jaccard for titles, directors and cast; and abso- 
lute difference for runtime and year. Humans labeled 200 entity 
pairs across each source pair. In our following experimental re- 
sults on this movie data, we learn the scoring functions on vari- 
ous sources (as specified) but evaluate precision and recall against 
movies from the pair IMDB and iTunes. This choice was made in 
order to demonstrate the behavior across a specific pair rather than 
averaging across all available pairs. We held out a subset of the 
movie data as the test set. We then used the remainder for training 
the methods. In order to improve the conditioning of the problem, 
as is standard in machine learning, we standardized the data by 
subtracting feature score means and dividing by standard deviation, 
making the features zero mean and unit variance and so placing the 
features on equal footing. 

Synthetic Data. We synthesized raw true attributes for each un- 
derlying latent entity uniformly at random in a unit interval. Then 
each record representing an entity in a source was produced by per- 
turbing each of the attributes randomly with low-variance Gaussian 
noise. Feature-level scores were then produced using a simple dif- 
ference between the attribute values of pairs of entities. It is im- 
portant to note that perturbing the feature-level scores would be 
an incorrect methodology since the scores would not observe any 
kind of triangle-inequality-like property as is the case for "real" ER 
problems. We produced up to 30 synthetic sources to stress test the 
approaches, and used 10k test pairs total. 

5. RESULTS 

We now present the results of our experiments, starting on the 
movie data. These results are presented by comparing the PR curves 
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Figure 3: Precision achieved by each method on the 6 movie 
sources, at 0.85 recall, for varying number of examples. 



of TRANSFER and the three baseline methods. We then focus at- 
tention on the synthetic data in order to gain a deeper understand- 
ing of the behavior of the transfer method. Our results conclu- 
sively demonstrate that TRANSFER requires significantly less la- 
beling while achieving superior accuracy over state-of-the-art ap- 
proaches in multiple-source ER. 

5.1 Precision Recall Curves 

With the datasets selected, we applied the four learning algo- 
rithm to the training set, producing pairwise functions fy, We 
then applied the functions to the unobserved test data and built 
PR curves by varying threshold parameter r. Figure [4] presents 
a three by three grid of PR curves. These figures each represent the 
effects of varying the number of training examples on the preci- 
sion and recall. Figure [3] summarizes these nine plots by fixing re- 
call at 0.85 — matching for a movie vertical requires high precision 
at the expense of lower recall — and visualizing achieved precision 
against total number of training examples. 

Consider the summary Figure|3] As expected SINGLE performs 
relatively well when very little training data is available, but does 
not experience much gain from additional training data — and is in- 
ferior to the other methods — owing to it not modeling the unique 
data characteristics of each source. PAIRWISElNDEPENDENT be- 
haves in the opposite manner to SINGLE: it just not able to fit 
its many-parameter models under little available data, but progres- 
sively improves as more data becomes. TRANSFER combines the 
best of both of the linear baseline models adaptively, and dominates 
all three state-of-the-art baselines at 0.85 recall. While NONLIN- 
EAR traces the performance of TRANSFER, it is not endowed with 
the correct pairwise task structure leveraged by TRANSFER and so 
its precision is significantly shifted down. 

Similar patterns are born out in the complete PR curves of Fig- 
uref4]which are also endowed with 95% confidence bands. In its 
first, second and third rows we vary the number of available exam- 
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Figure 4: Precision-recall curves comparing the four learning algorithms on the 6 sources of movie matching data; bands show 95% 
pointwise confidence intervals. Figures in the top/middle/bottom row show methods trained on numbers of examples per source 
pair/source/total; and figures in columns see increasing training data going from left to right. 
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Figure 5: Sample complexity of the three linear learning algo- 
rithms on synthetic data. 



Figure 6: Source complexity of the three linear learning algo- 
rithms on synthetic data. 



pies per source pair, per source, and in total, respectively. The same 
trends observed in Figure[3]are apparent here, however one method 
does not tend to dominate another for all recall values. 

In the final row and column, PAIRWISElNDEPENDENT catches 
up with TRANSFER, because the two sources that were picked to 
construct the PR curves are themselves large, so that a large number 
of training examples were assigned to that specific pair. Another in- 
teresting observation is that NONLINEAR varies significantly com- 
pared to the other methods (apparent from the width of the confi- 
dence bands). Such behavior is to be expected as the model com- 
plexity for NONLINEAR is the greater. Due to its poor performance 
on the movie data, we do not present results for NONLINEAR SVM 
on the synthetic data, where we focus on an apples-to-apples com- 
parison of the three linear learners with varying amounts of transfer. 

5.2 Sample Complexity 

Our next experiment takes a deeper look at understanding the 
effect of the number of examples per source pair on the average 
error defined in Equation (3}. These experiments were performed 
using synthetic data to give us finer control over the data generat- 
ing process and thus concretely explore how increasing the num- 
ber of examples will affect E. We observe from Figure [5] that as 
the number of examples increases TRANSFER and PAIRWISElN- 
DEPENDENT both decrease, while SINGLE remains lower bounded. 
This result is owing to the fact that SINGLE cannot take into account 
the individual differences between the sources that we are experi- 
menting with, while the other more flexible methods can. However, 
even though PAIRWISElNDEPENDENT does have that freedom, we 
see that TRANSFER still performs better because it is a "simpler" 
model to learn. 

5.3 Source Complexity 

We now present results of one of our most poignant synthetic ex- 
periments. Figure [6] shows the results of increasing the number of 
synthetic sources from 2 to 30 in increments of 2 sources. As we 



increase the number of sources we add a constant number of train- 
ing examples — we impose desirable linear not intractable quadratic 
labeling cost scaling. Again we compare the three linear methods, 
along with their confidence bands based on 50 trials. We observe 
that TRANSFER achieves far-superior results, actually improving 
slightly with the number of sources. On the other hand, we see that 
PAIRWISElNDEPENDENT, unsurprisingly, performs poorly as the 
number of sources increase as it requires quadratic scaling of the 
training data. The algorithm no longer has sufficient available ex- 
amples to train the order of 450 scoring functions. We also note that 
even though SINGLE is very "simple", and hence does not require 
a significant amount of training examples, it is unable to adapt to 
the fact that the sources have varying behavior. Hence, we observe 
the error increase as the SINGLE method is no longer able to model 
the behavior of the observed examples. 

5.4 Runtime Analysis 

Figure|7]shows the results of a timing analysis of the three linear 
methods. In the left figure we show the results with 20 examples per 
source pair, and in the right we show 100 examples per source pair, 
both on 10 synthetic sources. In both cases TRANSFER quickly 
achieves superior test error and converges relatively fast. The base- 
line linear methods converge faster but to inferior test error. In 
the small training set case SINGLE outperforms PAIRWISElNDE- 
PENDENT, and as expected as the number of training examples in- 
creases to 100 examples per source pair, PAIRWISElNDEPENDENT 
overtakes SINGLE and performs comparably to TRANSFER. 

6. RELATED WORK 

We now discuss prior work in ER as it is related to this paper, 
from the viewpoints of ER across sources of varying quality, and 
recognizing (and mitigating) the cost of generating labeled data. 
This paper is one of the first to consider ER on multiple sources of 
varying quality, is the first to highlight cost of labeling as a barrier 



Runtimes on Synthetic Data: 20 Examples per Source Pair 



Runtimes on Synthetic Data: 100 Examples per Source Pair 



Transfer 
Single 

Pairwise indep. 



Transfer 
Single 

Pairwise indep. 



1 2 3 

Runtime (seconds) 



4 6 
Runtime (seconds) 



10 



Figure 7: Runtime analysis of the three learning algorithms on (left) 20 (right) 100 examples per synthetic source pair. Across the 
board transfer yields superior performance after only a short time. Pairwise independent surpasses pooled once it is given around 
40 examples per source pair. 



to scaling accurate ER to multiple sources, and is the first to apply 
the transfer learning paradigm to ER (and along the way we develop 
a transfer learning algorithm that is novel to the machine learning 
community). 

Varying Source Quality. Numerous works have studied ER, 
many of which involving matching across multiple sources, but 
very few have explored the serious challenge of resolving sources 
with varying data quality. Traditionally researchers use a single 
non-learning-based matcher or model a single learner on pooled 
training data \29\. Shen et al. posited that real- world data sources 
have varying levels of semantic ambiguity, a special case of source 
quality in which an individual attribute should be given more or 
less weight. They propose the SOCCER framework for compiling 
matching execution plans in which one of two matchers can be em- 
ployed on different sources — a relaxed (conservative) matcher re- 
quiring less (respectively more) evidence to declare a match. These 
matchers could be related via the relaxation of a threshold, for ex- 
ample. Similar to [29], the adverse effect of heterogeneous source 
quality on matching accuracy is a key motivation of this paper. By 
contrast, however, our paper sets out to learn continuous charac- 
terizations of quality via real-valued weights, over all attributes 
together. Moreover we leverage a significantly more fine-grained 
transfer structure between the tasks of matching different source 
pairs, compared to the authors' process of simply pooling like tasks — 
which is more akin to the straw man SINGLE learner used for com- 
parison here, which does not correctly balance transfer with the 
needs of matching tasks. 

Kopcke & Rahm developed the STEM framework for investigat- 
ing the effect of training selection on learning to match 1 17 1. While 
they do not compare training on heterogeneous sources indepen- 
dently in a pairwise fashion versus together with a single matcher, 
they implicitly acknowledge the need to produce different match- 
ers tailored to source-pairs' characteristics as they split their most 



challenging experimental matching task of resolving publications 
between three sources — Google Scholar (larger but of lower qual- 
ity), DBLP and the ACM Digital Library (both higher quality but 
smaller) — into independent learning problems between DBLP and 
the other two sources. 

The Cost of Labeling. The key challenge solved by our ap- 
proach is to significantly reduce the human effort required to label 
training data for learning to combine matchers over multiple do- 
mains. In their thorough comparative evaluations of 21 ER systems 
with their FEVER framework (19| , Kopcke et al. explicitly identi- 
fied human effort as a key metric for the effectiveness of learning- 
based ER systems. In their line of work, this desire can be traced 
to their earlier paper on STEM |17| . In both works the authors pay 
particular attention to the effect of training-set size on the quality 
of matching, and favor methods requiring less labeled data such as 
the SVM. Unlike this paper, however, they do not consider match- 
ing across multiple sources and the additional requirement this can 
place on human labeling. 

While low sample complexity has been identified as a desirable 
property of entity matchers, the notion of labeling cost has not been 
previously recognized as a barrier to scaling ER. We introduce the 
notion of source complexity which characterizes the change in error 
as new sources are added, provided the number of training exam- 
ples are increased only linearly with the sources. 

7. CONCLUSIONS 

Many problems in databases, statistics and machine learning re- 
quire learning of a pairwise similarity function from human-labeled 
examples. However as the number of data sources increases, the 
sample complexity — the cost of human labeling — increases quadrat- 
ically. To overcome this prohibitive scaling, we propose a new 
transfer learning algorithm TRANSFER for learning multiple sim- 
ilarity score functions jointly. We take ER as a motivating exam- 



pie, and present extensive experimental comparisons of TRANS- 
FER against existing state-of-the-art methods. Our experiments — 
on real-world, large scale movie matching data, and extensive syn- 
thetic data — show that TRANSFER indeed produces more accurate 
results for ER than existing methods, with less data, and indeed 
in faster time. Interesting future work might consider combining 
active learning with TRANSFER, and extending TRANSFER to non- 
linear classification. 
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